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Abstract 

Log-linear models are a well-established method for describing sta- 
tistical dependencies among a set of n random variables. The observed 
frequencies of the n-tuples are explained by a joint probability such 
that its logarithm is a sum of functions, where each function depends 
on as few variables as possible. We obtain for this class a new model 
selection criterion using nonasymptotic concepts of statistical learn- 
ing theory. We calculate the VC dimension for the class of fc-factor 
log-linear models. In this way we are not only able to select the model 
with the appropriate complexity, but obtain also statements on the 
reliability of the estimated probability distribution. Furthermore we 
show that the selection of the best model among a set of models with 
the same complexity can be written as a convex optimization prob- 
lem. 
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1 INTRODUCTION 

Suppose a scientist is interested in the relation between n features 
described by the random variables Xi, X2, . ■ . , X n . Observing depen- 
dencies among these features is an important part of all scientific dis- 
ciplines. The dependencies may be deterministic, e.g., 

X n = f(Xi, . . . , X n -i) 

or (in the generic case) statistical, i.e., the probability for the event 
X\ = xi, Xi = X2, ■ ■ ■ , X n = x n , for short X = x, is not the product 
of the probabilities of the n events Xj = Xj. All information about 
these dependencies is contained in the joint probability distribution P 
that assigns the probability -P(x) = P(xi,X2, ■ ■ ■ , x n ) to each n-tuple. 
Notice that for large n no "reasonable" sample size is sufficient to de- 
termine the probabilities of all P(xi,X2, ■ ■ ■ , x n ) with good reliability 
since the size of := XjQj (where each Qj denotes the set of possi- 
ble values Xj) is exponentially large in n. This shows that we should 
not try to learn the joint probability distribution, we rather have to 
develop inference rules for learning the statistical dependencies in a 
weaker sense. The task is therefore to estimate some properties of the 
joint distribution described by empirical data of a given size. 

In the last years methods to judge scientific theories by their pre- 
dicted models on real world data sets become more and more impor- 
tant (Hagenaars 1994, Ishii-Kuntz 1997, Pitt and Myung 2002) and 
since the accessible computational power is increasing strongly this 
trend will proceed in future. 

It is a well-known fact, see e.g. (Akaike 1973, Hansen and Yu 
2001), that criteria describing the goodness of fit for the model to 
the data set only are not enough to judge the scientific relevance of 
the model. Additionally one has to take the complexity of the model 
into account. Usually the number of free parameters are treated as a 
measure for the model complexity and are incooperated in criteria like 
in Akaike's Information Criterion (AIC) (Akaike 1973), or Bayesian 
Information Criterion of Schwarz (1978). Bayesian Model Selection 
(Kass and Raftery 1995) and Minimum Description Length (Rissanen 
1996, Hansen and Yu 2001) take also the functional form of the model 
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into account. However, these criteria do not give any statements on 
the reliability of the estimation. 

In this article we derive reliability statements from statistical learn- 
ing theory (Vapnik 1998) and show that the trade-off between good- 
ness of fit and model complexity can be treated with structural risk 
minimization. 

Let us describe roughly the link between the ideas presented here 
and the question of minimal description length in information theory, 
see (Barron 1991, Rissanen 1996) and references therein. In the latter 
approach a risk functional is minimized over a set of probability dis- 
tributions where each probability distribution is penalized according 
to its description length. More explicitly, the risk functional consists 
of two terms: the "empirical risk" evaluates the "goodness of fit" and 
a regularization term penalizes the code length. Barron and Cover 
(1991) obtain bounds on the convergence rate of the risk functional 
in the Hellinger distance. The difference to our approach lies in the 
fact the regularization function can in principle be chosen arbitrar- 
ily, only Kraft's inequality most hold; whereas we use a penalty term 
coming directly from the complexity of the considered class of log- 
linear models and is based on statistical learning theory. Vapnik and 
Chervonenkis (1971) addressed the question of model selection from 
empirical data in a probabilistic setting (see also Vapnik (1999) and 
references therein). They introduced the so-called VC dimension of 
a function set and showed that the VC dimension of the function set 
which the model can implement is a crucial quantity to describe its 
complexity. In general the VC dimension does not agree with the 
usual dimension or the number of free parameters. 

From the statistical learning point of view, a scientific theory deliv- 
ers a prior on the class of models under consideration. Then the model 
is chosen according to some criterion measuring the fit on the empir- 
ical data. In this work we consider the class of all log-linear models 
and prior models with few statistical dependencies among the random 
variables, i.e. we prefer less complex models in terms of statistical 
independency as long as they describe the data well enough. With 
this assumption we estimate the model (complexity) directly from the 
empirical data rather than testing one model against another using 
test methods like the Pearson or likelihood ratio chi-square criterion, 
see e.g. Christensen 1997, Goodman 1978. Akaike (1973) already for- 
mulated this view point, however he did not have the conceptual tools 
of learning theory (Vapnik 1995). 
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The article is organized as follows. In Section [2 we introduce log- 
linear models and explain the prior for model selection leading to a 
risk functional on the data. We discuss the commonly used model 
selection criteria and explain their shortcomings. We present the idea 
of Markov networks and relate them to log-linear models. In Section 
13. II we calculate the VC dimension of the models under consideration 
and give the central result about the estimation of the risk functional 
for the true probability distribution. In Section 13.21 we use the idea 
of the structural risk minimization to formulate the model estimation 
for log-linear models. In Section \'A.'A\ we show that the optimal model 
is the solution of a convex optimization problem. 

2 LOG-LINEAR MODELS 

2.1 The Empirical Risk Functional 

In the work presented here we restrict our attention to discrete random 
variables and assume that X takes values in a finite set f2 = x™ =1 f2j. 
Assume a scientist tells us that he has found that the true joint prob- 
ability distribution P is given by Pt and we would like to test whether 
he is right or not. Suppose that his model is given by the distribu- 
tion that all n-tuples in f2 occur with equal probability. Assume, in 
contrast, that the true distribution P assigns the probability 1/k to k 
specific n-tuples xi, X2, . . . , x,t £ and that this k n-tuples are chosen 
without any simple law. Assume furthermore that k is large compared 
to the available sample size but small compared to the size of fi. Then 
we have only little chance to recognize that his model is wrong. Only 
if the k n-tuples are selected by a simple law we would mistrust the 
scientist's hypothesis. This example illustrates that we need a quality 
criterion for models which can be tested on a reasonable sample size. 
We suggest to use following criterion. For each n tuple x in our data 
set we give penalty points depending on how likely its appearance 
was according to the hypothetical joint distribution. A good choice, 
for instance, is the negative logarithm of the hypothetical probability 
-Pt(x). Let T be the training set, then we obtain 

-J^£lnP t (x) 
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as the "penalty" for the model -Pt(x). Notice that this sum converges 
in the large sample size limit to 

i?(P t ):="E P ( x ) lnP *( x )- 

Note that R(Pt) is closely related to the Kullback-Leibler distance 
(Cover and Thomas 1991) and can never be smaller than the entropy 

S(P) :=-£P(x)]nP(x). 

Hence, even if the scientist has predicted the true measure exactly, 
he will never get less than S(P) penalty points. If S(P) is high, 
he is not punished for building wrong models but for selecting the 
wrong features, i.e. features with too few dependencies. If the scien- 
tist considers additional random variables ("features influencing the 
considered ones") more specific statements might be possible. Rea- 
sonable research consists not only in observing dependencies among a 
given set of features, it consists also in selecting statistically relevant 
features yielding a low entropy of the joint distribution. Therefore the 
"empirical risk functional" 

WP):=-^E lnP *( x ) (2) 

is a good measure to test the quality of the distribution P t for large 
sample size. This risk functional is well-known for estimating proba- 
bilities (Vapnik 1998). 

If the space of all probability distributions is too large compared 
to the sample size, we should not minimize the empirical risk over all 
joint distributions. In order to infer a probability distribution from the 
statistical data we give the space a class structure. We use log-linear 
models (Christensen 1997, Goodman 1978) to define suitable classes. 
The simplest class of log-linear models is given in the case when all 
random variables are statistically independent. Then the logarithm of 
the joint probability distribution can be written as a sum of functions, 
each function depends on one variable only 

n 

lnP t (x 1 ,x 2 , ... ,x n ) = ^2fj{xj) , 

j=n 
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with fj(xj) = In P(xj). The next higher class is given by allowing 
two-variable interaction terms 



The next higher class contains terms with three-variable interactions, 
etc.. Probability distributions with the property that their logarithm 
can be expressed as sum of functions with k variables are called k- 
factor distributions. In other words the data is described by a k-J actor 
log-linear model. The idea of log-linear models come from the fact that 
some variables influence each other directly, expressed by the interac- 
tion terms, whereas the other dependencies are caused indirectly by 
influencing intermediate variables. The hope that real world data are 
well explained by log-linear models can be backed up by the idea that 
each variable is influenced only by a few other variables directly. The 
graph of the variables with edges representing the direct influences 
is simple. This idea can be formulated using Markov networks, see 
Section 12. ^1 



A simple way of choosing the degree of the interaction terms in a 
log-linear model goes as follows. We begin with the simplest model 
class and decide on the basis of some significance test whether the 
data suggest to reject the model or not. If so then we include two- 
variable interactions. The same significance test is applied to the new 
model. The procedure ends if frequencies observed in the data and the 
probabilities given by the model do not differ significantly according 
to the significance test. Let us describe two popular significance tests 
(Christensen 1997, Goodman 1978). 

Let j G {1, . . . , k} be the probability space consisting of k possible 
events. Let pi, . . . ,pk be the probabilities of the model and mi, . . . , rrik 
the observed frequencies. The sum / := Y2j m j 1S the sample size. 
Then we calculate 

1. Pearson chi-square Test: 



]nP t (xi, ...,x n 



j<n 



2.2 Model Selection and AIC 
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2. Deviance Test: 

G 2 :=2Vm,b?. 
j J 

Remark: G 2 / (21) is the Kullback-Leibler distance (Cover and Thomas 
1991) between the probability distribution (pj)j<k an d the relative fre- 
quencies (rrij /l)j<k- Furthermore the minimizer of G 2 and R emp are 
the same, i.e., G 2 /(2l) — R emp does not depend on (pj)j<k- 

Under the assumption that (pj)j<k is the true probability distri- 
bution the values of the random variables X 2 and G 2 are in the large 
sample limit x 2 -distributed (Christensen 1997). Using this argument 
the model is rejected if the X 2 or G 2 values are too large. The range 
of acceptable values for X 2 and G 2 is based on the following rule: 

First we have to calculate the number of degrees of freedom (df) 
of the model, (df) is given by the number of possible events minus 1 
minus the number of free parameters in the model class. Given df one 
can look up in a table with \ 2 distribution whether the probability 
that X 2 or G 2 , respectively, is outside a certain confidence interval. 

However, the idea of rejecting a simple model only if the data 
contradicts the externally given significance level leads to conserva- 
tive models. Suppose the outcome of the chi-squared test of a model 
yields values of x 2 or G 2 within a certain confidence interval. Never- 
theless there might be a model with only one interaction term more 
such that the \ 2 or G 2 value is considerably decreased. In this case we 
will certainly prefer the model that is a little bit more complex. This 
example illustrates that the externally given significance level implic- 
itly determines the model selection. Therefore it is crucial to find a 
more systematic way for model selection and prevent in this way over- 
and underfitting. 

Let (Pg)g e ^k be a set of probability distributions with k parameters 
and assume that the true probability distribution P lies in (Pe)e<ER k - 
The true probability distribution P is the minimizer of R and for 
a sample size which is large compared to the size of (Pe)em. k ^ the 
empirical risk functional R em p is a good approximation for R. In this 
case we can search for the minimizer of R em p instead. This leads 
to the Deviance test above. However we are interested in the case 
where we cannot assure that the data set is sufficiently large. Then 
the minimizer of R emp is in general not a good estimation for the 
minimizer of R. 
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Akaike (1973) suggested to modify the empirical risk functional 
according to the number of free parameters of the model under consid- 
eration. He showed under some regularity conditions on the mapping 
9^ P e , that 

Bemp df /I 

is a consistent estimator of R, i.e., it has the same expectation value 
as R. However minimizing R em p + df/l for a finite data set give us 
no information about the true risk R. In Section [3.11 we obtain some 
bounds on the true risk and suggest to minimize this guaranteed risk 
rather than the expected risk, see Section 13.11 

2.3 Why Log- linear models 

In Section[21we have introduced log-linear model classes without giving 
any justification for it. In this section we explain why one should 
expect that probability distributions generating real world data can 
be described by fc-factor models with small k. The key idea is that the 
distribution of a random variable Xj is only influenced by the values a 
few others variables. This kind of statistical relations can be encoded 
by a Markov network. 

The presentation of Markov networks in this work follows Pearl 
(1988). For each random variable Xi we define the Markov boundary 
Bi C {Xi, . . . , Xi-\, Xi+i, . . . , X n } as the smallest set fulfilling 

P(xi\xi, . . .,Xi-x,x i+ i, ...,x n ) = P{xi\Bi) . (3) 

Observe that this definition does not take any (temporal or causal) 
order of the random variables into account. It will therefore lead to 
an undirected graph. This is in contrast to the directed graphs in 
the context of Bayesian networks (Pearl 1985) which are for instance 
useful to formalize causal structure (Pearl 2000). 

In case of strict positive probability distributions the Markov bound- 
ary is unique. Actually, strict positivity of the probability distribu- 
tions is also a crucial condition for our approach to learn probabilities 
as we will see in Section 13.11 We now define the Markov network as 
the graph Q with n nodes also denoted by X\, . . . , X n and undirected 
edges between Xi and Xj if and only if Xj € Bi (or equivalently 
Xi £ Bj). In other words, we can generate the Markov network for a 
distribution P if we connect each of the n nodes with the element of 
its Markov boundary. As a consequence we obtain that two random 
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variables Xj and Xj, i ^ j are conditionally independent with respect 
to a set Z C X \ {Xj,Xj}, i.e. 

P(X i \X j ,Z) = P(X i \Z), 

if and only if every path from Xi to Xj has one node in Z. In other 
words, if Z blocks every path between Xi and Xj then Xj and Xj are 
conditionally independent (with respect to Z). A clique in the graph 
Q is a subgraph in which all nodes are connected to each other. We 
have an (partial) order on the cliques given by the inclusion as sets. 

Let Q be a Markov graph for a strictly positive probability distri- 
bution P and let C be the set of maximal cliques in Q. Then one can 
show that P factorizes as follows 

= jf II > ( 4 ) 

where X c C X is the set of random variables in the cliques c. N is the 
normalization constant and 4> c are the compatibility functions w.r.t. 
the clique c, also called the potential functions. 

Suppose two distributions have the same Markov graph, then they 
describe the same (in) dependency in X. In Section 13.11 we will see 
that the maximal number in a clique deg(£7) = max cg( 7 #c are a ap- 
propriate parameters to subdivide the set of all distributions 'P(X) 
into small model classes. 

Consider the case that every clique has less than k + 1 elements. 
Then the distribution can be described by a /c-factor model due to the 
factorization in eq. ((IJl. The Markov boundary can be understood in 
a rather literary sense if the variables represent some quantities that 
are measured at different positions in the real space. Then the belief 
to obtain probability distributions that correspond to simple Markov 
networks stems from the locality principle: distant variables influence 
each other only indirectly via other variables. 

In statistical physics log-linear models can be justified even more 
directly. Assume the variable Xj describes the physical state of par- 
ticle j and that the total energy of the system in the state x = 
(xi, . . . , x n ) is a sum of functions depending on at most k particles, 
i.e., 

j 

where j runs over all fe-subsets in {1, ... ,n}. Then the thermody- 
namic equilibrium state, the so-called Gibbs distribution, is up to a 
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normalization factor directly given by the exponential of the nega- 
tive energy function. Hence the equilibrium distribution is exactly 
described by a /c-factor model. In statistical physics one usually con- 
siders two particle interactions only, therefore 2-factor models describe 
the equilibrium state. 

Now we have justified why the class structure of log- linear models 
is natural. However, we have to refine our hierarchy in some respect. 
So far, we did not formulate any restrictions on the range of the prob- 
ability distribution of the models. Intuitively it is clear that we can 
assign a small probability to an event x € £1 only if we have a very 
large data set. This means that reasonable model selection should 
allow small probabilities only for large data sets. However, the empir- 
ical risk functional R ernp penalizes for small probabilities only if they 
are contained in the data set. Therefore the value of R emp depends 
strongly on the specific data set if the probability distribution is not 
bounded from below by a strict positive constant. Actually, for deriv- 
ing a bound for the risk functional R in Section Rj.ll we precisely need 
to bound the probability in this way. 

Definition 1 Let V\ be the set of all strict positive functions greater 
than A > 0. Let V k be the set of probability distributions P corre- 
sponding to a k-factor model, i.e., 

P{X\ , X2, ■ ■ ■ , X n ) = J^J qj {Xj 1 , Xj 2 , . . . , Xj s ) , 

j 

where each q$ is a positive function and j := (ii, J2, • • • ,js) runs over 
all possible subsets of {1,2, ... , n} with s elements. Then we define 
the model class of degree k by 

H£ = Vx{X)nV k . (5) 

Remark: The values qj (xj 1 , Xj 2 , . . . , Xj„) cannot be interpreted as prob- 
abilities since they can be greater than 1. 
the 



3 LEARNING LOG-LINEAR 
MODELS 

It is a well-known problem in any mathematical theory of learning 
that a model can explain the training data well but does not fit the 
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data observed in the future if the model class is too large. If the model 
class is small enough, one knows with high confidence that the model 
which minimizes the error on the training data will also explain the 
future data almost optimal within this model set. 

3.1 Risk estimation 

For a set of real-valued functions on an arbitrary set Q a relevant 
measure for the size of this set is the so-called VC dimension (Vapnik- 
Chervonenkis). First, let us define the VC dimension (Vapnik 1998) 
of a set of indicator functions (f a ), i.e. /q, : — > {0, 1}. 

Definition 2 Let A be an index set of arbitrary cardinality. Let 
(/a)aeA be a set of indicator functions on Q. Then the VC dimen- 
sion of (f a )a<=A is the largest number h such that there exist h points 
xi, X2, ■ ■ ■ , Xh € Q with the property that for every function 
X ■ {x±, . . . ,Xh} — > {0, 1} there exists a function f a such that its re- 
striction to {x\,X2, • • • , Xh} coincides with %. 

The VC dimension for a set of real-valued functions is defined by the 
VC dimension for the set of corresponding indicator functions. 

Definition 3 Let (f a )aeA be a family of real-valued functions on a 
set Q. Then the VC dimension of (f a )aeA * s the VC dimension of 
the family of the indicator functions (xn fa))j,eM.,aeA, where Xn is 
Xm( x ) =0forx<fi and x M (z) = 1 for x > \x. 

The notion of VC dimension plays a crucial role in statistical learn- 
ing theory. It is a measure which indicates whether a learning ma- 
chines overfits or not. Roughly speaking, if a family of functions 
which can be implemented by a learning machine has small VC di- 
mension then every function which fits well the training data will fit 
with high probability the test data as well. The following theorem is 
an Corollary of Theorem 5.1 pp.192 in (Vapnik 1998). 

Theorem 4 Let (f a )aeA be a measurable set of bounded real-valued 
functions on Vt, A < f a (z) < B, and let the set of indicator functions 
have finite VC dimension h. Let fx be a probability distribution onQx 
K and let the data xi , X2 , . . . , be drawn according to fj, ( independent 
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and identically distributed). Then we have with probability at least 

1 — 7) 



where R em p(fa) '■= }I]j<i/a(xj) is the empirical risk and R(f a ) := 
J f a (x)d(i(x) the true risk. 

Remark: We have R(P a ) = R(lnP a ) and R e mp(Pa) = RempQnPa) 
when (P a ) is a set of probability distributions. Therefore we can use 
this theorem to calculate the generalization error of equation Q for 
the family of probability distributions defined in equation 

First, let us interpret the generalization error in this context. Sup- 
pose we have found a probability distribution P t in some model class 
that minimizes R then we know that in this space Pt is the closest 
distribution to the true distribution in the Kullback-Leibler distance. 
Furthermore, we can conclude with high confidence that future data 
will fit to this distribution as well. 

Next we estimate the VC dimension of product distributions since 
this will be the leading intuition for estimating the VC dimension of 
the model class 

Lemma 1 Let X%, X2, ■ ■ ■ , X n be discrete random variables where Xj 
takes rrii different values in fij for i = 1, . . . , n. Then the VC dimen- 
sion of the family of all product distributions V 1 = {P(x±, . . . ,x n ) = 
P(x\) ■ . . . ■ P(x n ) : P is a probability distribution} is given by N = 

1 + Th=1 m i ~ n - 

Remark: For the families (lnP)p e pi and (P)p G -pi the set of indicator 
functions coincides. Hence (lnP)p G ^i and (-P)pgpi have the same 
VC dimension. 

Proof: Let Xj-o be an arbitrary element of the set Hj. Due to 
-P(x) = P{x\) . . . P{x n ) we can write the logarithm of the joint prob- 
ability as 

lnP(x) = ^ln(P(x J )/P(x J;0 )) + ^lnP(x J;0 ) 

3 3 

We can characterize an n-tuple x = (xi,X2,--- ,x n ) uniquely by an 
N = Ylj( m j ~ 1) dimensional vector with entries and 1 as follows: 
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The vector consists of n blocks of dimension (nij — 1) characterizing 
Xj. Let Xj-o, Xj : i, . . . , Xj- mj -i be the elements of fij in an arbitrary 
ordering. For Xj = Xj-i with i ^ let the i-th entry of the j-th block 
be 1 and the other entries of the j-th block be zero. For Xj = Xj-o let 
all entries of the j-th block be zero. 

For each n-tuple the logarithm of its probability is given as follows. 
Let c x £ M. N be the vector corresponding to x := (x\, X2, • • • , x n ). 
Then we have 

lnP(x)=^ C f/ z + d=(c x ,/)+d 
I 

where the coordinates // of / are given by the values ln(P(xj ] i)/ P(xj-o)) 
with j < n and 1 < i < rrij — 1. The constant term d is defined as 
d := Ylj m P( x j-,o)- This shows that In -P(x) can be written as an affine 
functional in W N . Hence the VC dimension of V is at most N + 1 (see 
(Vapnik 1998), Example in Chapter 5.2.3). To show that it is not 
smaller than N + 1 note that there is no restriction at all to the set 
of all possible vectors / if d is chosen appropriately. This can be seen 
as follows. Let gi, g2, . . . , g mj -i be the entries of / in the j-th. block. 
Then set 



and 



El<i<m,-l ex P(^) + 1 
1 



El<i<m-1 ex P(fi) + 1 ' 

Now we have to show that one can find N + 1 vectors that can be 
classified in all 2 N+1 possible ways such that the vectors correspond 
to n-tuples. Consider the vectors ej having 1 at the position j and 
elsewhere. Obviously, each vector ej corresponds to a possible n-tuple. 
As N + 1-th vector we choose the origin (0, 0, . . . , 0). It corresponds to 
the n-tuple (x\-q,X2-o, ■ ■ ■ ,x n -o). These vectors can be classified in all 
2 N+1 possibilities as follows. Choose an arbitrary indicator function 
X on these N + 1 points and set / := (±1, ±1, . . . , ±1) with positive 
sign at position j if and only if x( e j) = 1- Define d as above in order 
to ensure that 

x^ (c x ,f) + d 

defines the logarithm of a probability function. Set a := d ± 1/2 and 
choose the sign positive if and only if x((0, 0, . . . ,0)) = 1. Then the 
indicator function Xa° w with w(x) := (c x , f) + d coincides with the 
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desired classifier \- Hence the VC dimension of product distributions 
is N + 1. □ 

Now we give a bound for the VC dimension of the model classes 
introduced in Section [221 

Lemma 2 The VC dimension ofV k is at most 

h k :=^2m jl m j2 ...m jk , 
j 

where rrii is the size of the set Oj. 

Remark: h s is at most 0(n k ) if rrii is uniformly bounded. 

Proof: Let P E V k . Since P factorizes we can characterize the 
expression InP uniquely by a vector / € W 1 as follows: the vector / 
consists of (?) blocks where the blocks are indexed by j given in the 
Lemma. The block j has size Yli m ji an d the entries within the block 
have the values 

In <7j [Xj 1 , Xj 2 , . . . , Xj k ) , 

where (xj x , Xj 2 , . . . , Xj k ) runs over all elements of Qj t x 0j 2 x • • • x £lj k . 
Each n-tuple x = (x±, X2, ■ ■ ■ , x n ) can be characterized by a vector 
c x £ M. h as follows: In each block j there is exact one coordinate that 
corresponds to the s-tuple (xj 1 ,Xj 2 ,..., Xj k ) . Set this entry to 1 and 
the other entries of the block to 0. Then In P(x) is given by the inner 
product of the vectors / and c x . We obtain the lemma since the VC 
dimension of the set of linear functionals on R h is h, see (Vapnik 1998). 
□ 

Obviously, the model class "ri\ is a subset of V k and so we can give 
an upper bound the VC dimension of with hf. which is indepen- 
dent of A. Therefore Lemma Inland Theorem 0] yields to the following 
theorem. 

Theorem 5 Let Ti^. be the model class of degree k and be the 
upper bound on the VC dimension of Hi given in Lemma\^ Then for 
a training data Xi,X2, . . . we have for P a E TC^ with probability at 
least 1 — 77 

R(Pa) < Remp(Pa) + 4>(k, A, 7?) , (6) 
with (j)(k, A, 77) := - In A ^ fa*-lnfa fc +lnl6+lni-ln^ 

The task to find the minimizer of R ern p in H± even for fixed A and 
k is computationally expensive. We will not treat this question here. 
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Theorem|Sltells us how to estimate the risk R(P a ) for a a priori chosen 
model class Tii. However we do not know which class to select. This 
question is addressed in the next section. 

3.2 Estimating Log-linear Models from empir- 
ical data 

We do not want to exclude A;-factor models with large k a priori if 
the data strongly indicates that such a complex model is appropriate. 
Similarly, we do not want to exclude distributions with small A. We 
rather want to avoid such models if the data does not give us strong 
enough reason for choosing them. To solve this problem we use a 
generalized version of the structural risk minimization principle (Vap- 
nik 1998, Vapnik 1995). In this way the result of Theorem can be 
extended such that k and A need not to be chosen in advanced. For 
doing so we have to define a prior probability measure v on the set of 
possible pairs (k, A) in advance in order to get a bound on the true 
risk. 

We obtain from Theorem[2]with a standard union bound argument, 
see e.g. Lemma 4.1 in (Herbrich 2002), following corollary. 

Corollary 1 Let (A n ) n£ N be a strict monotone decreasing sequence 
converging to zero. Let v be a strict positive probability measure on N 2 
and Ak n = {1, . . . k} X {1, . . . n}. Then for training data xi, X2, . . . ,X/ 
we have with probability at least 1 — r] 

R(P) < Remp(P) + <Kk, K,V v{Akn)) , 

for P € Ti^." and <f> defined above. 

The probability measure v in Corollary ^ has to be chosen before 
seeing the training data. Usually one chooses v to be inversely pro- 
portional to <p. However, it should be emphasized that the statement 
in Corollary^ does not assume that the measure v is chosen according 
to any prior probabilities for the possible probability measures as used 
in Bayesian learning. 

Now we can formulate structural risk minimization as follows: 

Minimize t + (j)(k, A, r\ u(Ak n )) 

(fc,n)GN 2 

subject to t = min R emp (P) 
Paul" 
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3.3 Model Computation by Convex Optimiza- 
tion 

In this section we show that the minimization of the empirical risk 
Remp is a convex optimization problem in the class of all fc-factor 
models for a fixed k and A. We formulate the optimization problem 
for lnP rather than for P. Let x 1 , . . . , x' be the training data. For 
each set j = {ji, . . . ,jt} C {1, . . . ,n} let Vj be the vector space of 
real- valued functions on the set fij := Qj 1 x • • • x Qj k . Define 

V k := ©jVj 

where j runs over all subsets of {1, . . . , n} of size k. Now fix k and j 
of size k. The set Slj is discrete and so the functions 

Jl if w 1 = v 1 ,...,w k = v k 

e wi,W2,...,w k {Vl, V2, ■ ■ ■ , Vk) — < 

I else 

form a canonical basis of Vj. Further let xj be the fc-tuple given by the 
restriction of x € ft to the variables Xj 1 , . . . , Xj fe and xj the restriction 
of the training data x l to the variables of xj . 

Now, let us formulate the optimization problem of the empirical 
risk. Find a vector / = ©j/j G with /j <G Vj such that 

i=l j 

is minimal subject to the constraints 

Z(f) :=^exp(^/ j (x j )) = l 
xen j 

and 

Gx(/) := 2 ^( x j) ^ lo § A for a11 x € a 
j 

Then the probability measure P G is given by In/. The functions 
-R emp and G x are obviously linear in /. The function Z is convex in 
/ since the exponent is linear in / and the exponential function is 
convex. Clearly, the sum of the convex functions 

/^exp(^/j(xj)) 
j 
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over all x is convex. Therefore the problem can be solved with stan- 
dard convex optimization methods (Pallaschke and Rolewicz 1997). 
However, for a large number of variables the solution becomes compu- 
tationally expensive. Notice that the number of constraints in (JJJ) in- 
creases exponentially with n. Furthermore the calculation of R ernp (f) 
involves the summation over exponentially many possible ra-tuples x. 
This problem is similar to the calculation of the partition function in 
statistical mechanics which requires a summation over exponentially 
many states of a many-particle system. Those computational prob- 
lems are not different from those appearing in conventional approaches 
to log-linear model selection. 

4 CONCLUSION 

We used structural risk minimization of statistical learning theory to 
obtain a new selection criterion for log-linear models. In this we way 
have not only a criterion to choose the suitable model complexity, but 
we found also a bound on the actual risk, i.e. the key feature of our 
approach is that it provides statements of the form "if a log-linear 
model of certain simplicity fits well to the observed data then we are 
guaranteed with high probability that it will fit future observations as 
well" . This kind of nonasymptotic statistics become more and more 
important for in research areas like cognitive and social science, where 
the number of possibly relevant features for the system under inves- 
tigation is large. It offers a new way of developing models, namely, 
infering statistical dependencies amoung a large number of features 
from data. 

We structured the class of log-linear models according to the degree 
of the interaction terms. Using other measures of complexity than VC 
dimension like Gaussian complexity (Bartlett and Mendelson 2001) we 
believe that the obtained bounds can be improved. The structure of 
the model classes can be refined taking for example prior knowledge 
into account. In future this model selection criterion need to be tested 
on real-world data and the results compared to existing ones. 
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